library(data.table)
library(dplyr)
library(ggplot2)
X2016 <- read.csv("world-happiness-2/2016.csv")
data = as.data.table(X2016)
names = c("Country", "Region", "HappinessRank", "Score", "LowerCI", "UpperCI", "GDP", "Family", "LifeExp", "Freedom", "Corruption", "Generosity", "DystopianResidual")
names(data) = names
First, we need to import the raw CSV file, read it in as a data frame, and convert it to a data table in order to use R packages such as dplyr. Next, we rename the columns with appropriate headers and validated the classes of each column, which were the variables GDP, freedom, life expectancy, family, corruption, generosity, and dystopian residual. In order to validate the data, we had to make sure variables such as country and region were factors with 157 and 10 levels, respectively, representing the 157 countries and 10 regions of the world. Happiness rank is an integer representing the rank of the country’s happiness score, given as a num. All other variables, excluding country, region, and happiness rank, are supposed to be numeric decimal values. The descriptions of the columns are given as:
summary(data)
## Country Region HappinessRank
## Afghanistan: 1 Sub-Saharan Africa :38 Min. : 1.00
## Albania : 1 Central and Eastern Europe :29 1st Qu.: 40.00
## Algeria : 1 Latin America and Caribbean :24 Median : 79.00
## Angola : 1 Western Europe :21 Mean : 78.98
## Argentina : 1 Middle East and Northern Africa:19 3rd Qu.:118.00
## Armenia : 1 Southeastern Asia : 9 Max. :157.00
## (Other) :151 (Other) :17
## Score LowerCI UpperCI GDP
## Min. :2.905 Min. :2.732 Min. :3.078 Min. :0.0000
## 1st Qu.:4.404 1st Qu.:4.327 1st Qu.:4.465 1st Qu.:0.6702
## Median :5.314 Median :5.237 Median :5.419 Median :1.0278
## Mean :5.382 Mean :5.282 Mean :5.482 Mean :0.9539
## 3rd Qu.:6.269 3rd Qu.:6.154 3rd Qu.:6.434 3rd Qu.:1.2796
## Max. :7.526 Max. :7.460 Max. :7.669 Max. :1.8243
##
## Family LifeExp Freedom Corruption
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.6418 1st Qu.:0.3829 1st Qu.:0.2575 1st Qu.:0.06126
## Median :0.8414 Median :0.5966 Median :0.3975 Median :0.10547
## Mean :0.7936 Mean :0.5576 Mean :0.3710 Mean :0.13762
## 3rd Qu.:1.0215 3rd Qu.:0.7299 3rd Qu.:0.4845 3rd Qu.:0.17554
## Max. :1.1833 Max. :0.9528 Max. :0.6085 Max. :0.50521
##
## Generosity DystopianResidual
## Min. :0.0000 Min. :0.8179
## 1st Qu.:0.1546 1st Qu.:2.0317
## Median :0.2225 Median :2.2907
## Mean :0.2426 Mean :2.3258
## 3rd Qu.:0.3119 3rd Qu.:2.6646
## Max. :0.8197 Max. :3.8377
##
plot(data)
datasubset = data[, c(4, 7,8,9,10,11,12)]
cor(datasubset)
## Score GDP Family LifeExp Freedom
## Score 1.0000000 0.79032202 0.73925158 0.76538433 0.5668267
## GDP 0.7903220 1.00000000 0.66953969 0.83706723 0.3622828
## Family 0.7392516 0.66953969 1.00000000 0.58837678 0.4502082
## LifeExp 0.7653843 0.83706723 0.58837678 1.00000000 0.3411993
## Freedom 0.5668267 0.36228285 0.45020820 0.34119929 1.0000000
## Corruption 0.4020322 0.29418478 0.21356094 0.24958329 0.5020540
## Generosity 0.1568478 -0.02553066 0.08962885 0.07598731 0.3617513
## Corruption Generosity
## Score 0.4020322 0.15684780
## GDP 0.2941848 -0.02553066
## Family 0.2135609 0.08962885
## LifeExp 0.2495833 0.07598731
## Freedom 0.5020540 0.36175133
## Corruption 1.0000000 0.30592986
## Generosity 0.3059299 1.00000000
First, we would like to see the summary statistics for each variable and plot the pairwise relationships to make initial observations. We notice that the ranges of variables including generosity, corruption, and freedom, are not as large as the ranges of variables like GDP, family, and life expectancy.
From the plots of the variables as well as the correlation table, we can see some relationships with linear trend and strong correlation, so we must investigate further.
We can visualize the breakdown of the Happiness Scores for each Region of the world, and compare the means of each Region to the countries within it, as well as with other Regions.
It appears as if the Australia and New Zealand Region has the highest mean Happiness Score of 7.32, and the Sub-Subharan Africa Region has the lowest mean Score of 4.14. We can see a large range of scores within each region such as the Latin America and Caribbean and Middle East and Northern Africa Regions due to a large number of countries in that region, and smaller ranges for the Australia and New Zealand and North America Regions due to only a few countries being in that region.
We want to break down these statistics by Region even further, so we can look at how each factor such as GDP, freedom, life expectancy, family, corruption, generosity and dystopian residual can be influenced by the Region and see if that can show how the overall happiness score is affected.
All the plots below show how much each variable contributes to the overall Happiness Score.
ovr = data %>% group_by(Region) %>% summarise(GDP = mean(GDP))
ggplot(data, aes(Region, GDP, color = Region)) + geom_point()+ geom_point(data = ovr, size = 4, alpha = .8) + geom_text(data = ovr, aes(label = round(GDP, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much GDP Contributes to Happiness")
ovr = data %>% group_by(Region) %>% summarise(Family = mean(Family))
ggplot(data, aes(Region, Family, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8) + geom_text(data = ovr, aes(label = round(Family, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much Family Contributes to Happiness")
ovr = data %>% group_by(Region) %>% summarise(LifeExp = mean(LifeExp))
ggplot(data, aes(Region, LifeExp, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8) + geom_text(data = ovr, aes(label = round(LifeExp, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much Life Expectency Contributes to Happiness")
ovr = data %>% group_by(Region) %>% summarise(Freedom = mean(Freedom))
ggplot(data, aes(Region, Freedom, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8) + geom_text(data = ovr, aes(label = round(Freedom, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much Freedom Contributes to Happiness")
ovr = data %>% group_by(Region) %>% summarise(Corruption = mean(Corruption))
ggplot(data, aes(Region, Corruption, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8) + geom_text(data = ovr, aes(label = round(Corruption, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much Government Corruption Contributes to Happiness")
ovr = data %>% group_by(Region) %>% summarise(Generosity = mean(Generosity))
ggplot(data, aes(Region, Generosity, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8) + geom_text(data = ovr, aes(label = round(Generosity, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much Generosity Contributes to Happiness")
ovr = data %>% group_by(Region) %>% summarise(DystopianResidual = mean(DystopianResidual))
ggplot(data, aes(Region, DystopianResidual, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8) + geom_text(data = ovr, aes(label = round(DystopianResidual, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much the Dystopian Residual Contributes to Happiness")
We observe that the GDP, family, life expectancy, and freedom variables show a clear difference between the means and ranges of factors within the regions. Other variables like corruption, generosity, and dystopian residual do not differ as much region to region. Now that we have seen how each variable contributes to the happiness score in each region, we can look at the relationship between a country’s score and how much each factor contributes to the score. We can also see if a variable appears to have a strong linear trend with respect to a country’s score.
ggplot(data, aes(x = GDP, y = Score)) + geom_point(aes(color = Region)) + geom_smooth(method = "loess") + labs(title = "GDP vs Score")
ggplot(data, aes(x = LifeExp, y = Score)) + geom_point(aes(color = Region)) + geom_smooth(method = "loess") + labs(title = "Life Expectancy vs Score")
ggplot(data, aes(x = Freedom, y = Score)) + geom_point(aes(color = Region)) + geom_smooth(method = "loess") + labs(title = "Freedom vs Score")
ggplot(data, aes(x = Corruption, y = Score)) + geom_point(aes(color = Region)) + geom_smooth(method = "loess") + labs(title = "Corruption vs Score")
ggplot(data, aes(x = Generosity, y = Score)) + geom_point(aes(color = Region)) + geom_smooth(method = "loess") + labs(title = "Generosity vs Score")
ggplot(data, aes(x = DystopianResidual, y = Score)) + geom_point(aes(color = Region)) + geom_smooth(method = "loess") + labs(title = "Dystopian Residual vs Score") + geom_vline(xintercept = 1.85)
After examining these plots, we can see that GDP, family, life expectancy, and freedom all appear to have a strongly correlated and positive linear trends from their loess regression plots. Although the extremes of these variables do not seem to follow the same linear trends, the points appear to generally follow a positive linear trend. For examples, the higher the GDP contribution, the higher the happiness score of a country is likely to be.
Corruption, generosity, and dystopian residual do not appear to have any linear trend or a strong correlation as seen from the regression lines as well as the spray of the points for the country’s variable. The loess line is a curve, and for the same corruption score value, there is a large range in overall happiness score. For example, for a corruption score of 0.1, there are countries with a happiness score between 3 and 7.
Now that we know more information about how each factor affects score, we can try to create multilinear regression models using forward selection and stepwise selection in both directions.
We want to create a model to predict the happiness score from the factors: GDP, freedom, life expectancy, family, corruption, and generosity. As mentioned earlier, dystopian residual is excluded as it is used more as a comparison statistic rather than using it to predict a happiness score.
Using forward selection and the F test statistic, we add the variable with the largest F value (smallest p-value) to the model in order to minimize the AIC value. The process of adding variables to a model one-by-one is shown below:
score_fit = lm(Score ~ 1, data = data)
indep.vars = ~ GDP + Family + LifeExp + Freedom + Corruption + Generosity
add1(score_fit, indep.vars, test = "F")
## Single term additions
##
## Model:
## Score ~ 1
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 203.333 42.600
## GDP 1 127.004 76.330 -109.226 257.9027 < 2.2e-16 ***
## Family 1 111.120 92.213 -79.547 186.7808 < 2.2e-16 ***
## LifeExp 1 119.115 84.218 -93.786 219.2273 < 2.2e-16 ***
## Freedom 1 65.329 138.004 -16.247 73.3752 1.005e-14 ***
## Corruption 1 32.865 170.469 16.922 29.8826 1.798e-07 ***
## Generosity 1 5.002 198.331 40.690 3.9094 0.04979 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
score_fit = update(score_fit, . ~ . + GDP)
add1(score_fit, indep.vars, test = "F")
## Single term additions
##
## Model:
## Score ~ GDP
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 76.330 -109.23
## Family 1 16.2683 60.061 -144.86 41.713 1.312e-09 ***
## LifeExp 1 7.3238 69.006 -123.06 16.344 8.320e-05 ***
## Freedom 1 18.4162 57.913 -150.58 48.971 7.490e-11 ***
## Corruption 1 6.3977 69.932 -120.97 14.089 0.0002466 ***
## Generosity 1 6.3762 69.953 -120.92 14.037 0.0002529 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
score_fit = update(score_fit, . ~ . + Freedom)
add1(score_fit, indep.vars, test = "F")
## Single term additions
##
## Model:
## Score ~ GDP + Freedom
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 57.913 -150.58
## Family 1 8.2876 49.626 -172.82 25.5514 1.215e-06 ***
## LifeExp 1 5.7291 52.184 -164.93 16.7973 6.727e-05 ***
## Corruption 1 0.4853 57.428 -149.90 1.2929 0.2573
## Generosity 1 0.7921 57.121 -150.74 2.1216 0.1473
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
score_fit = update(score_fit, . ~ . + Family)
add1(score_fit, indep.vars, test = "F")
## Single term additions
##
## Model:
## Score ~ GDP + Freedom + Family
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 49.626 -172.82
## LifeExp 1 5.0887 44.537 -187.81 17.3672 5.153e-05 ***
## Corruption 1 1.1562 48.470 -174.52 3.6257 0.05878 .
## Generosity 1 0.6567 48.969 -172.91 2.0383 0.15543
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
score_fit = update(score_fit, . ~ . + LifeExp)
add1(score_fit, indep.vars, test = "F")
## Single term additions
##
## Model:
## Score ~ GDP + Freedom + Family + LifeExp
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 44.537 -187.81
## Corruption 1 1.27522 43.262 -190.37 4.4510 0.03653 *
## Generosity 1 0.20506 44.332 -186.53 0.6985 0.40462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
score_fit = update(score_fit, . ~ . + Corruption)
add1(score_fit, indep.vars, test = "F")
## Single term additions
##
## Model:
## Score ~ GDP + Freedom + Family + LifeExp + Corruption
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 43.262 -190.37
## Generosity 1 0.055882 43.206 -188.57 0.194 0.6602
summary(score_fit)
##
## Call:
## lm(formula = Score ~ GDP + Freedom + Family + LifeExp + Corruption,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48327 -0.28172 -0.02771 0.32803 1.46147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2119 0.1502 14.731 < 2e-16 ***
## GDP 0.6971 0.2094 3.329 0.0011 **
## Freedom 1.5588 0.3733 4.175 5.01e-05 ***
## Family 1.2344 0.2289 5.393 2.62e-07 ***
## LifeExp 1.4623 0.3430 4.263 3.53e-05 ***
## Corruption 0.9590 0.4546 2.110 0.0365 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5353 on 151 degrees of freedom
## Multiple R-squared: 0.7872, Adjusted R-squared: 0.7802
## F-statistic: 111.7 on 5 and 151 DF, p-value: < 2.2e-16
As shown in the summary, the final model is Score ~ GDP + Freedom + Family + LifeExp + Corruption, with a p-value of < 2.2e-16, which is statistically significant.
In order to see if there are other possible models, we can use the step function, which adds or removes the independent variables GDP, Family, LifeExp, Freedom, Corruption, and Generosity to the model to minimize AIC.
score_aic = lm(Score ~ 1, data = data)
indep.vars = ~ GDP + Family + LifeExp + Freedom + Corruption + Generosity
score_out = step(score_aic, scope = indep.vars, direction = "both")
## Start: AIC=42.6
## Score ~ 1
##
## Df Sum of Sq RSS AIC
## + GDP 1 127.004 76.330 -109.226
## + LifeExp 1 119.115 84.218 -93.786
## + Family 1 111.120 92.213 -79.547
## + Freedom 1 65.329 138.004 -16.247
## + Corruption 1 32.865 170.469 16.922
## + Generosity 1 5.002 198.331 40.690
## <none> 203.333 42.600
##
## Step: AIC=-109.23
## Score ~ GDP
##
## Df Sum of Sq RSS AIC
## + Freedom 1 18.416 57.913 -150.58
## + Family 1 16.268 60.061 -144.86
## + LifeExp 1 7.324 69.006 -123.06
## + Corruption 1 6.398 69.932 -120.97
## + Generosity 1 6.376 69.953 -120.92
## <none> 76.330 -109.23
## - GDP 1 127.004 203.333 42.60
##
## Step: AIC=-150.58
## Score ~ GDP + Freedom
##
## Df Sum of Sq RSS AIC
## + Family 1 8.288 49.626 -172.823
## + LifeExp 1 5.729 52.184 -164.930
## + Generosity 1 0.792 57.121 -150.738
## <none> 57.913 -150.576
## + Corruption 1 0.485 57.428 -149.897
## - Freedom 1 18.416 76.330 -109.226
## - GDP 1 80.090 138.004 -16.247
##
## Step: AIC=-172.82
## Score ~ GDP + Freedom + Family
##
## Df Sum of Sq RSS AIC
## + LifeExp 1 5.0887 44.537 -187.81
## + Corruption 1 1.1562 48.470 -174.52
## + Generosity 1 0.6567 48.969 -172.91
## <none> 49.626 -172.82
## - Family 1 8.2876 57.913 -150.58
## - Freedom 1 10.4355 60.061 -144.86
## - GDP 1 28.6222 78.248 -103.33
##
## Step: AIC=-187.81
## Score ~ GDP + Freedom + Family + LifeExp
##
## Df Sum of Sq RSS AIC
## + Corruption 1 1.2752 43.262 -190.37
## <none> 44.537 -187.81
## + Generosity 1 0.2051 44.332 -186.53
## - GDP 1 3.8695 48.407 -176.73
## - LifeExp 1 5.0887 49.626 -172.82
## - Family 1 7.6472 52.184 -164.93
## - Freedom 1 9.5958 54.133 -159.17
##
## Step: AIC=-190.37
## Score ~ GDP + Freedom + Family + LifeExp + Corruption
##
## Df Sum of Sq RSS AIC
## <none> 43.262 -190.37
## + Generosity 1 0.0559 43.206 -188.57
## - Corruption 1 1.2752 44.537 -187.81
## - GDP 1 3.1742 46.436 -181.25
## - Freedom 1 4.9945 48.256 -175.22
## - LifeExp 1 5.2078 48.470 -174.52
## - Family 1 8.3320 51.594 -164.72
summary(score_out)
##
## Call:
## lm(formula = Score ~ GDP + Freedom + Family + LifeExp + Corruption,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48327 -0.28172 -0.02771 0.32803 1.46147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2119 0.1502 14.731 < 2e-16 ***
## GDP 0.6971 0.2094 3.329 0.0011 **
## Freedom 1.5588 0.3733 4.175 5.01e-05 ***
## Family 1.2344 0.2289 5.393 2.62e-07 ***
## LifeExp 1.4623 0.3430 4.263 3.53e-05 ***
## Corruption 0.9590 0.4546 2.110 0.0365 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5353 on 151 degrees of freedom
## Multiple R-squared: 0.7872, Adjusted R-squared: 0.7802
## F-statistic: 111.7 on 5 and 151 DF, p-value: < 2.2e-16
As we can see from the output of the stepwise function, we get the same model as before: Score ~ GDP + Freedom + Family + LifeExp + Corruption. This model uses all of the factors except generosity, which is not statistically significant to include. We can see how well this model fits the data by observing the plots below:
par(mfrow = c(2,2))
plot(score_fit)
critval = qt(0.05/(2*nobs(score_fit)), df=df.residual(score_fit)-1, lower=FALSE)
which(abs(rstudent(score_fit)) > critval)
## named integer(0)
The residuals plot shows that points are equally spread and fall on a flat line, so the constant variance assumption appears to be satisfied. The points on the Normal Q-Q plot generally fall on the line, except for at the extremes. This means the normality assumption also holds. There are no outliers or influential points, as shown in the plots and the Bonferroni adjusted critical value.
plot(data$Score, score_fit$fitted.values)
abline(0, 1, col = "red")
This plot confirms that the model’s predicted values closely match the observed happiness scores when using the six predictor variables to predict scores. We conclude that the final model Score ~ GDP + Freedom + Family + LifeExp + Corruption fits our data well.